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Abstract 

Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling 
discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), 
and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that 
under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an 
efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not 
explicitly depend on the number of distinct (discrete) observations — it implicitly depends on this quantity through 
spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large 
number of observations, such as those in natural language processing where the space of observation is sometimes 
the words in a language. The algorithm is also simple: it employs only a singular value decomposition and matrix 
multiplications. 



1 Introduction 



Hidden Markov Models (HMMs) IBa um and Eagoiu ll967, Rabiner, 1989] are the workhorse statistical model for dis- 
crete time series, with widely diverse applications including automatic speech recognition, natural language processing 
(NLP), and genomic sequence modeling. In this model, a discrete hidden state evolves according to some Markovian 
dynamics, and observations at particular time depend only on the hidden state at that time. The learning problem 
is to estimate the model only with observation samples from the underlying distribution. Thus f ar, the predominant 



learning algorithms h ave been local search heuristics, such as the Baum-Welch / EM algorithm IBa um et al 
Dempster etailll977ll . 



1970L 



It is not surprising that practical algorithms hav e resorted to he uristics, as the general learning problem has been 
shown to be hard under cryptographic assumptions iTerwiinl 1200211 . Fortunately, the hardness results are for HMMs 
that seem divorced from those that we are likely to encounter in practical applications. 

The situation is in many ways analogous to learning mixture distributions with samples from the underlying dis- 
tribution. There, the general problem is believed to be hard. However, much recent progress has been made when 
certain separation assumptions are made with respect to the component mixture distributions (e.g. llDasguptaill999l 
Das gup ta and Schulman , 2007 , Vempala and Wang , 2002 , Chaudhuri and Rao , 2008 , Brubaker and Vempalal 2008 1). 



Roughly speaking, these separation assumptions imply that with high probability, given a point sampled from the 
distribution, we can recover which component distribution generated this point. In fact, there is prevalent sentiment 
that we are often only interested in clustering the data when such a separation condition holds. Much of the theoretical 
work here has been on how small this separation need be in order to permit an efficient algorithm to recover the model. 

We present a simple and efficient algorithm for learning HMMs under a certain natural separation condition. 
We provide two results for learning. The first is that we can approximate the joint distribution over observation 
sequences of length t (here, the quality of approximation is measured by total variation distance). As t increases, 
the approximation quality degrades polynomially. Our second result is on approximating the conditional distribution 
over a future observation, conditioned on some history of observations. We show that this error is asymptotically 
bounded — i.e. for any t, conditioned on the observations prior to time t, the error in predicting the <-th outcome is 
controlled. Our algorithm can be thought of as 'improperly' learning an HMM in that we do not explicitly recover the 
transition and observation models. However, our model does maintain a hidden state representation which is closely 
(in fact, linearly) related to the HMM's, and can be used for interpreting the hidden state. 
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The separation condition we require is a spectral condition on both the observation matrix and the transition matrix. 
Roughly speaking, we require that the observation distributions arising from distinct hidden states be distinct (which 
we formalize by singular value conditions on the observation matrix). This requirement can be thought of as being 
weaker than the separation condition for clustering in that the observation distributions can overlap quite a bit — given 
one observation, we do not necessarily have the information to determine which hidden state it was generated from 
(unlike in the clustering literature). We also have a spectral condition on the correlation between adjacent observations. 
We believe both of these conditions to be quite reasonable in many practical applications. Furthermore, given our 
analysis, extensions to our algorithm which relax these assumptions should be possible. 

The algorithm we present has both polynomial sample and computational complexity. Computationally, the algo- 
rithm is quite simple — at its core is a singular value decomposition (SVD) of a correlatio n matrix between past and 
future observations. This SVD can be viewed as a Canonical Correlation Analysis (CCA) 1 Hotellingl 1935 1 between 
past and future observations. The sample complexity results we present do not explicitly depend on the number of dis- 
tinct observations; rather, they implicitly depend on this number through spectral properties of the HMM. This makes 
the algorithm particularly applicable to settings with a large number of observations, such as those in NLP where the 
space of observations is sometimes the words in a language. 



1.1 Related Work 

There a re two i deas closely related to this work. The first comes fro m the subspace identification literature in control 
theory I Liund, 1987 , Overschee and Moot , 1 19961 iKatavamal 1200511 . The second idea is that, rather than explicitly 
modeling the hidden states, we can represent the probabilities of sequences of observation s as products of matrix 



observation operators, an idea whi ch dates back to the literature on multiplicity automata ISchiitzenbergeri 11961 
Carlvle and Pazl 1 1 97 ll iFliessl 1 1 974 1 . 



The subspace identification methods, used in control theory, use spectral approaches to discover the relationship 
between hidden states and the observations. In this literature, the relationship is discovered for linear dynamical 
systems such as Kalman filters. The basic idea is that the relationship between observations and hidden states can 
often be discovered by spectral/S VD methods correlating the past and future observations (in particular, such methods 
often do a CCA between the past and future observations). However, algorithms presented in the literature cannot 
be directly used to learn HMMs because they assume additive noise models with n oise distributions indepe ndent of 



the underlying states, and such models are not suitable for HMMs (an exception is MAndersson et all 1200311 ). In our 



setting, we use this idea of performing a CCA between past and future observations to uncover information about the 
observation process (this is done through an SVD on a correlation matrix between past and future observations). The 
state-independent additive noise condition is avoided through the second idea. 

The second idea is that we can represent the probability of sequences as products of matr ix op erators, as in the 



litera ture on multiplicity automata IISchutzenbergeiill96UlCarlvle and Pazl I 19711 IFliessl 1 197411 (see lEven-Dar et al. 



2005] for discussion of this relationship). This i dea was re-used in b oth the Observable Operator Model of iJaegei 
and the Predictive State Representations of iLittman et al.l 1200 ill . 30th of which are closely related and both of 
which can model HMMs. In fact, the former work by Ijaegen 1200011 provides a non-iterative algorithm for learning 
HMMs, with an asymptotic analysis. However, this algorithm assumed knowing a set of 'characteristic events', which 
is a rather strong assumption that effectively reveals some relationship between the hidden states and observations. In 
our algorithm, this problem is avoided throug h the first idea 



Some of the techniques in the work in lEven-Dar et all 1200711 for tracking belief states in an HMM are used 
here. As discussed earlier, we provide a result showing how the model's conditiona l distributions over obs ervations 



(conditioned on a history) do not asymptotically diverge. This result was proven in BEven-Dar et aU 1200711 when an 



approximate model is already known. Roughly speaking, the reason this error does not diverge is that the previous ob- 
servations are always revealing information about the next observation; so with some appropriate contraction property, 
we would not expect our errors to diverge. Our wor k borrows from this con t raction analysis 



A mong recent efforts i n vari ous communities lAndersson et al 



2003L IVanluyten et all 120071 IZhao and Jaegei 



2007llCybenko a nd Cresrji, 2008], t he only previous efficient algorithm shown to PAC-learn HMMs in a setting similar 
to ours is due to iMossel and Rochl Q2OO60 . Their algorithm for HMMs is a specialization of a more general method 
for learning phylogenetic trees from leaf observations. While both this algorithm and ours rely on the same rank 
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condition and compute similar statistics, they differ in two significant regards. First, iMossel and Roch , l2006ll were 
not concerned with large observation sp aces, and thus their algo rithm assumes the state and observation spaces to 
have the same dimension. In addition, IMossel and R och. l200dl take the more ambitious approach of learning the 
observation and transition matrices explicitly, which unfortunately results in a less stable and less sample-efficient 
algorithm that injects noise to artificially spread apart the eigenspectrum of a probability matrix. Our algorithm 
avoids recovering the observation and transition matrix explicitljQ, and instead uses subspace identification to learn an 
alternative representation. 



2 Preliminaries 



2.1 Hidden Markov Models 

The HMM defines a probability distribution over sequences of hidden states (ht) and observations (xt). We write the 
set of hidden states as [to] = {1, . . . , m} and set of observations as [n] = {1, . . . , n}, where to < n. 

Let T £ M. mxm be the state transition probability matrix with Tij = Pv[h t+ i = i\h t = j], O G R nxm be the 
observation probability matrix with Oij = Pr[xt = i\fh = j], and 7? € R m be the initial state distribution with 
■Ki = Pr[/ii = i]. The conditional independence properties that an HMM satisfies are: 1) conditioned on the previous 
hidden state, the current hidden state is sampled independently of all other events in the history; and 2) conditioned on 
the current hidden state, the current observation is sampled independently from all other events in the history. These 
conditional independence properties of the HMM imply that T and O fully characterize the probability distribution of 
any sequence of states and observations. 

A useful way of computing the probability of se quences is in terms of 'observation operators', an idea which dates 
back to the literature on multiplicity automata (see J Schiitzenbergerl ll96lllCarlvle and Pazj[l97lllFliessl[l97l ). The 



following lemma is straightforward to verify (see OJaegen, 120001 lEven-Dar 



Carlvle and r; 
et al.ll2007ln ." 



Lemma 1. For x = 1, . . . , n, define 
For any t: 



A x = Tdiag(0 Kj i, . . . , O x>m ). 



Pr[xi, . . . ,x t ] = l m A Xt . . . A Xl n. 
Our algorithm learns a representation that is based on this observable operator view of HMMs. 



2.2 Notation 

As already used in LemmaQ] the vector l m is the all-ones vector in M Tn . We denote by x\-t the sequence (x\, . . . , x t ), 
and by xt-i its reverse (xt, ■ . ■ , x\). When we use a sequence as a subscript, we mean the product of quantities indexed 
by the sequence elements. So for example, the probability calculation in Lemma[TJcan be written I^Ax^tt. We will 
use h t to denote a probability vector (a distribution over hidden states), with the arrow distinguishing it from the 
random hidden state variable h t . Additional notation used in the theorem statements and proofs is listed in Table [TJ 



2.3 Assumptions 

We assume the HMM obeys the following condition. 

Condition 1 (HMM Rank Condition). 7? > element-wise, and O and T are rank to. 

The rank condition rules out the problematic case in which some state i has an output distribution equal to a convex 
combination (mixture) of some other states' output distributions. Such a case could cause a learner to confuse state i 
with a mixture of these other states. As mentioned before, the general task of learning HMMs (even the specific goal 

1 In Appendix|c] we discuss the key step in iMossel and R och. 2006], and also show how to use their technique in conjunction with our algorithm 
to recover the HMM observation and transition matrices. Our algorithm does not rely on this extra step — we believe it to be generally unstable — but 
it can be taken if desired. 
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of simply accurately modeling the distribution probabilities [Terwiin, 2002]) is hard under cryptographic assumptions; 
the rank condition is a natural way to exclude the malicious instances created by the hardness reduction. 

The rank condition on O can be relaxed through a simple modification of our algorithm that looks at multiple 
observation symbols simultaneously to form the probability estimation tables. For example, if two hidden states have 
identical observation probability in O but different transition probabilities in T, then they may be differentiated by 
using two consecutive observations. Although our analysis can be applied in this case with minimal modifications, for 
clarity, we only state our results for an algorithm that estimates probability tables with rows and columns corresponding 
to single observations. 

2.4 Learning Model 

Our learning model is similar to those of BKearns et al.L 11994 iMossel and Roch , l200dl for PAC-learning discrete 



probability distributions. We assume we can sample observation sequences from an HMM. In particular, we assume 
each sequence is generated starting from the same initial state distribution (e.g. the stationary distribution of the Markov 
chain specified by T). This setting is valid for practical applications including speech recognition, natural language 
processing, and DNA sequence modeling, where multiple independent sequences are available. 

For simplicity, this paper only analyzes an algorithm that uses the initial few observations of each sequence, and 
ignores the rest. We do this to avoid using concentration bounds with complicated mixing conditions for Markov chains 
in our sample complexity calculation, as these conditions are not essential to the main ideas we present. In practice, 
however, one should use the full sequences to form the probability estimation tables required by our algorithm. In such 
scenarios, a single long sequence is sufficient for learning, and the effective sample size can be simply discounted by 
the mixing rate of the underlying Markov chain. 

Our goal is to derive accurate estimators for the cumulative (joint) distribution Pr[a;i : t] and the conditional distri- 
bution Pr[a;t|a;i : t_i] for any sequence length t. For the conditional distribution, we obtain an approximation that does 
not depend on t, while for the joint distribution, the approximation quality degrades gracefully with t. 



3 Observable Representations of Hidden Markov Models 

A typical strategy for learning HMMs is to estimate the observation and transition probabilities for each hidden state 
(say, by maximizing the likelihood of a sample). However, since the hidden states are not directly observed by 
the learner, one often resorts to heuristics {e.g. EM) that alternate between imputing the hidden states and selecting 
parameters O and T that maximize the likelihood of the sample and current state estimates. Such heuristics can suffer 
from local optima issues and require careful initialization (e.g. an accurate guess of the hidden states) to avoid failure. 

However, under Condition[T| HMMs admit an efficiently learnable parameterization that depends only on observ- 
able quantities. Because such quantities can be estimated from data, learning this representation avoids any guesswork 
about the hidden states and thus allows for algorithms with strong guarantees of su ccess. 

This parameterization is natural in the context of Observable Operator Models Ijaegei . 2000tl . but here we empha- 
size its connection to subspace identification. 

3.1 Definition 

Our HMM representation is defined in terms of the following vector and matrix quantities: 

[Pi]i = Pr bi = *] 
[P2.i] i:j = Pr[x 2 =i,xi = j) 
[P3,x,i\ij = Pr[x 3 = i,x 2 = x,xi = j] \fx £ [n], 

where Pi <G M. n is a vector, and Pa i <G R nxn and the P$ Xt \ <E M. nxn are matrices. These are the marginal probabilities 
of observation singletons, pairs, and triples. 

The representation further depends on a matrix U £ !>™ xm that obeys the following condition. 
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Condition 2 (Invertibility Condition). U O is invertible. 

In other words, U defines an m-dimensional subspace that preserves the state dynamics — this will become evident 
in the next few lemmas. 

A natural choice for U is given by the 'thin' SVD of P2.1, as the next lemma exhibits. 

Lemma 2. Assume 7? > and that O and T have column rank m. Then rank(P2,i) = m. Moreover, ifU is the matrix 
of left singular vectors of P2,i corresponding to non-zero singular values, then rangc(L/) = rangc(O), so U G R nxm 
obeys Condition^ 

Proof. Using the conditional independence properties of the HMM, entries of the matrix P^i can be factored as 

m m 

[Pi,i]ij = 2J Pt [ x z = ' T i = J) ^2 = k,hi = £} 

k=l 1=1 
m m 

= ^2 ^ ifc T?t [0 T }(] 
k=l t=l 

soP2,i = OTdiag(7f)0 T and thus range(P2,i) Q range(O). The assumptions on O, T, and 7? imply that Tdiag(-7?)0 
has linearly independent rows and that i-2,1 has m non-zero singular values. Therefore 

O = P 2 ,i(^diag(7r)0 T )+ 

(where X + denotes the Moore-Penrose pseudo-inverse of a matrix X), which in turn implies range(O) C range(p2 i i). 
Thus rank(P2.i) = rank(O) = m, and also range(C/) = rangc^P^i) = rangc(O). I 

Our algorithm is motivated by Lemma|2]in that we compute the SVD of an empirical estimate of p2,i to discover 
a U that satisfies Condition|2] We also note that this choice for U can be thought of as a surrogate for the observation 
matrix O (see Remark|5). 

Now given such a matrix U, we can finally define the observable representation: 

bi = U T P 1 
boo = {PliU) + Pi 

B x = (U J P 3 , x ,i) (U T P 2 ,i) + VxG[n]. 

3.2 Basic Properties 

The following lemma shows that the observable representation {600, 0i> B\, . . . , B n } is sufficient to compute the 
probabilities of any sequence of observations. 

Lemma 3 (Observable HMM Representation). Assume the HMM obeys Condition\T}and that U € I nxm obeys 
Condition^ Then: 

1. bi = (U T 0)n. 

2. bZ = Z(u T o)-\ 

3. B x = (U t O)A x (U t O)~ 1 Vx g [n]. 

4. Pr[xi :t ] = b^B^Jt VteN,i 1 ,...,i t £[4 

In addition to joint probabilities, we can compute conditional probabilities using the observable representation. We 
do so through (normalized) conditional 'internal states' that depend on a history of observations. We should emphasize 
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that these states are not in fact probability distributions over hidden states (though the following lemma shows that 
they are linearly related). As per Lemma[3] the initial state is 

hi = (U T 0)n. 

Generally, for any t > 1, given observations xut-i with Pr[xi : t_i] > 0, we define the internal state as: 

B Xt _ 1A bi 



h = b t (xi-t-i) 



b^B^Ji 



Thecaset = 1 is consistent with the general definition of b t because the denominator is b^bi = lJ n (U T 0) 1 (U T 0)n 
Tj n n = 1. The following result shows how these internal states can be used to compute conditional probabilities 

Pr[x t = i\x 1:t -i]. 

Lemma 4 (Conditional Internal States). Assume the conditions in Lemma\3\ Then, for any time t: 

1. (Recursive update of states) If "Pr[xi : t] > 0, then 

t _ B xt bt 

Ot + l — -T- — , 

bl>B Xt b t 

2. (Relation to hidden states) 

b t - (U t O) h t (xi-.t-i) 

where [ht{x\-t-i)\i = Pr[/it = i|xi : t_i] is the conditional probability of the hidden state at time t given the 
observations Xut—i, 

3. ( Conditional observation probabilities) 

Pi[x t \xi; t -i] = b^B^bt. 

Remark 5.IfU is the matrix of left singular vectors of Pi^\ corresponding to non-zero singular values, then U acts 
much like the observation probability matrix O in the following sense: 

Given a conditional state bt, Given a conditional hidden state h t , 

Pi[x t = ijxi.t-i] = [Ub t ]i. Pr[x t = i\xi-.t-i} = [Oh t ]i. 

To see this, note that UU T is the projection operator to range(£7). Since range([/) = range(O) (Lemma]2$, we have 
UU t O = O, so Ub t = U{U T 0)h t = Oh t . 

3.3 Proofs 

Proof of Lemma\3\ The first claim is immediate from the fact P\ = On. For the second claim, we write P\ in the 
following unusual (but easily verified) form: 

P X T = r^Tdiag(7f)0 T 

= lJ n {U T Oy 1 {U T 0)TAi&g{n)0 T 
= lJ n (U T 0)- 1 U T P 2 ,i. 

The matrix U T P2.1 has linearly independent rows (by the assumptions on n, O, T, and the condition on U), so 

SI - p7(u t p 2 ,i) + = iJni^or 1 (c/ t p 2 , 1 ) (u t p 2 , 1 ) + = iJ n {u T oy\ 
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To prove the third claim, we first express P^, x ,\ in terms of A x : 



P 3 ,x,i = OA x Tdmg(Tr)0 T 

= OA x (U T 0)- 1 (U T 0)Tdia,g(Tr)0 T 
= OA x (U T Oy 1 U T P 2 ,i. 



Again, using the fact that U T P 2 .\ has full row rank, 

B x = (U T P 3 , Xtl ) {U T P 2 . 1 ) + 

= (u T o)A x {u T oy 1 {u t p 2A ) {u t p 2A ) + 

= (U t O)A x (U t O)- 1 . 

The probability calculation in the fourth claim is now readily seen as a telescoping product that reduces to the product 
in Lemma Q] I 

Proof of Lemma\4\ The first claim is a simple induction. The second and third claims are also proved by induction as 
follows. The base case is clear from Lemma [3] since hi = ir and b\ = (U t O)tt, and also b^B-xJji = I^A^tt = 
Pr[xi]. For the inductive step, 

r B Xt b t B Xt (U T Q)h t (U T Q)A Xt h t . 

o t+ i = — — = — -. — j r = — - r — j t (definition, inductive hypothesis, LemmaQJl 

b^B^bt Pr[xt\x 1:t -i\ Pr[x t |xi : t_iJ 

= (U^O) ^Vl^'T' 11 - P^-'^N^-!] = {U T 0) Kt+i{xi:t) 

(the first three equalities follow from the first claim, the inductive hypothesis, and LemmaO, and 

bocB Xt+1 b t +i = l^ n A Xt+1 h t+1 = Pi[x t+ i\x 1:t ] 

(again, using Lemma|3]l. I 



4 Spectral Learning of Hidden Markov Models 
4.1 Algorithm 

The representation in the previous section suggests the algorithm LEARNHMM(m, N) detailed in Figure [1] which 
simply uses random samples to estimate the model parameters. Note that in practice, knowing m is not essential 
because the method presented here tolerates models that are not exactly HMMs, and the parameter m may be tuned 
using cross-validation. As we discussed earlier, the requirement for independent samples is only for the convenience 
of our sample complexity analysis. 

The model returned by LearnHMM(to, N) can be used as follows: 

• To predict the probability of a sequence: 

Pr[xi, ...,x t ] = k^Bxt ■ ■ ■ BxM- 

• Given an observation x t , the 'internal state' update is 

h+i = 



B Xt b t 
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Algorithm LearnHMM(to, N): 

Inputs: m - number of states, TV - sample size 

Returns: HMM model parameterized by {61, b^, B x Vx G [n]} 

1 . Independently sample N observation triples (xi,X2, £3 ) from the HMM to form empirical estimates 
Pi,h,i,h,x,i Vx e [n] of/'i./V. /'',.,.: Vx G [»]. 

2. Compute the SVD of P 2 ,i, and let U be the matrix of left singular vectors corresponding to the m 
largest singular values. 

3. Compute model parameters: 

(a) bi = U T P U 

(b) = (%0)+P u 

(c) B x = U T P 3 , Xil (U T P 2A )+ Vx g [n]. 



Figure 1 : HMM learning algorithm. 
• To predict the conditional probability of x t given X\ : t-\: 

Pr[x t |xi :t _i] 



blB Xt b t 



E x blB x b t 

Aside from the random sampling, the running time of the learning algorithm is dominated by the SVD computation 
of an n x n matrix. The time required for computing joint probability calculations is 0(tm 2 ) for length t sequences — 
same as if one used the ordinary HMM parameters (O and T). For conditional probabilities, we require some extra 
work (proportional to n) to compute the normalization factor. However, our analysis shows that this normalization 
factor is always close to 1 (see LemmaQj]), so it can be safely omitted in many applications. 



4.2 Main Results 

We now present our main results. The first result is a guarantee on the accuracy of our joint probability estimates for 
observation sequences. The second result concerns the accuracy of conditional probability estimates — a much more 
delicate quantity to bound due to conditioning on unlikely events. We also remark that if the probability distribution 
is only approximately modeled as an HMM, then our results degrade gracefully based on this approximation quality. 



4.2.1 Joint Probability Accuracy 

Let a m (M) denote the mth largest singular value of a matrix M. Our sample complexity bound will depend polyno- 
mially on l/cr m (P 2 ,i) and l/a m (0). 
Also, define 

e(k) = min Pr[x 2 = j] : S C [n], \S\ = n - k J , (1) 

and let 

no(e) = minjfc : e(k) < e}. 

In other words, no (e) is the minimum number of observations that account for about 1 — e of the total probability mass. 
Clearly no(e) < n, but it can often be much smaller in real applications. For example, in many practical applications, 
the frequencies of observation symbols observe a power law (called Zipf's law) of the form f(k) oc l/k s , where f(k) is 
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the frequency of the fc-th most frequently observed symbol. If s > 1, then e(fc) = 0(fc 1_s ), and no(e) = 0(e 1 '^ 1 ~ 8 ') 
becomes independent of the number of observations n. This means that for such problems, our analysis below leads 
to a sample complexity bound for the cumulative distribution Pr[xi :f ] that can be independent of n. This is useful in 
domains with large n such as natural language processing. 

Theorem 6. There exists a constant C > such that the following holds. Pick any < e, rj < 1 and t > 1, and let 
e = c m (0)cr m (i : 2,i) e / '{4ty/m). Assume the HMM obeys Conditional^ and 

t 2 / m rn-n (s) \ 1 

" ° ' ? ' U™(0) 2 <r m (P 2 ,i) 4 + a m (0) 2 a m (P 2a ) 2 J ' %' 

Wi'f/i probability at least 1—7], the model returned by the algorithm L,EARNHMM(to, N) satisfies 

^] | Pr[ari, . . . , x t ] - Pr[x 1 , . . . , x t \ \ < e. 

Xl,...,X t 

The main challenge in proving Theorem|6]is understanding how the estimation errors accumulate in the algorithm's 
probability calculation. This would have been less problematic if we had estimates of the usual HMM parameters T 
and O; the fully observable representation forces us to deal with more cumbersome matrix and vector products. 

4.2.2 Conditional Probability Accuracy 

In this section, we analyze the accuracy of our conditional predictions Pr[xt|a;i, . . . , Xt-\\. Intuitively, we might 
hope that these predictive distributions do not become arbitrarily bad over time, (as t — > oo). The reason is that 
while estimation errors propagate into long-term probability predictions (as evident in Theorem [6), the history of 
observations constantly provides feedback about the underlying hidden state, and this information is incorporated 
using Bayes' rule (implicitly via our internal state updates) . 

This intuition was confirmed by Even-Par et alj 1200711 . who showed that if one has an approximate model of T 
and O for the HMM, then under certain conditions, the conditional prediction does not diverge. This condition is the 
positivity of the 'value of observation' 7, defined as 

7 = inf ||Ou||i. 

iT:||iT||i— 1 

Note that 7 > a m (0) / \fn, so it is guaranteed to be positive by Condition Q] However, 7 can be much larger than 
what this crude lower bound suggests. 

To interpret this quantity 7, consider any two distributions over hidden states h,h £ M"\ Then || 0(h — h) || 1 > 
7 1| h — h\\i. Regarding h as the true hidden state distribution and h as the estimated hidden state distribution, this 
inequality gives a lower bound on the error of the estimated observation distributions u nder O. In other word s, the 



observation process, on average, reveal errors in our hidden state estimation. The work of Even-Par et all 1200711 uses 
this as a contraction property to show how prediction errors (due to using an approximate model) do not diverge. In 
our setting, this is more difficult as we do not explicitly estimate O nor do we explicitly maintain distributions over 
hidden states. 

We also need the following assumption, which we discuss further following the theorem statement. 

Condition 3 (Stochasticity Condition). For all observations x and all states i and j, [A x ]ij > a > 0. 

Theorem 7. There exists a constant C > such that the following holds. Pick any < e,r) < 1, and let e = 
f m (0)cr m (P2 i i)e/(4y / m). Assume the HMM obeys Conditions\l\and\3\ and 

N > C (( ™ 1 (log(2A*)) 4 \ 1 1 rn-Me) \ , 1 

^e 2 a 2+ J t T m (0)V m (P 2 , 1 ) 4 + e 2 <r m (0) V m (P 2jl ) 2 J 8 77' 

With probability at least 1 — 7], then the model returned by LEARNHMM(m, N) satisfies, for any time t, 

Pr[x t |xi :t _i] 



KL(Pv[x t \xi, . . .,x t -i) || Pr[x t |xi, . . . ,x t -i]) 



h 



Pr[x t \xi :t -i] 



' e. 
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To justify our choice of error measure, note that the problem of bounding the errors of conditional probabilities is 
complicated by the issue of that, over the long run, we may have to condition on a very low probability event. Thus 
we need to control the relative accuracy of our predictions. This makes the KL-divergence a natural choice for the 
error measure. Unfortunately, because our HMM conditions are more naturally interpreted in terms of spectral and 
normed quantities, we end up switching back and forth between KL and L\ errors via Pinsker-style inequalities (as in 



I Even-Par et all 1200711 ). It is not clear to us if a significantly better guarantee could be obtained with a pure L\ error 



analysis (nor is it cl ear how to do such an analysis) 



The analysis in lEven-Dar et all 1200711 (which assumed that approximations to T and O were provided) dealt with 
this problem of dividing by zero (during a Bayes' rule update) by explicitly modifying the approximate model so that 
it never assigns the probability of any event to be zero (since if this event occurred, then the conditional probability is 
no longer defined). In our setting, Condition [3] ensures that true model never assigns the probability of any event to 
be zero. We can relax this condition somewhat (so that we need not quantify over all observations), though we do not 
discuss this here. 

We should also remark that while our sample complexity bound is significantly larger than in Theorem|6l we are 
also bounding the more stringent KL-error measure on conditional distributions. 

4.2.3 Learning Distributions e-close to HMMs 

Our L\ error guarantee for predicting joint probabilities still holds if the sample used to estimate Pi, Pa i, Pz,x,\ come 
from a probability distribution Pr[-] that is merely close to an HMM. Specifically, all we need is that there exists some 
imax > 3 and some m state HMM with distribution Pr HMM [-] such that: 

1. p r HMM satisfies Condition!]] (HMM Rank Condition), 

2. For all t < i max , £ ai!t | Pr[z lit ] - Pr HMM [x 1:t ]| < e HMM (t), 

3. e HMM (2) « ±a m (P™ M ). 

The resulting error of our learned model Pr is 

^iPr^l-PrM < e HMM (t)+^|Pr HMM [x 1:t ]-Pr[x 1:t ]| 



for all t < t max . The second term is now bounded as in Theorem[6l with spectral parameters corresponding to Pr 



HMM 



5 Proofs 

Throughout this section, we assume the HMM obeys Condition Q] Table Q] summarizes the notation that will be used 
throughout the analysis in this section. 

5.1 Estimation Errors 

Define the following sampling error quantities: 

ei = l|A-Pi||2 

e 2,l = IIP2.I — -f*2,l || 2 
C3,x,l = |PM:,1 — P3,x,l||2 

The following lemma bounds these errors with high probability as a function of the number of observation samples 
used to form the estimates. 
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m, n 


Number of states and observations 


n (e) 


Number of significant observations 


U,1,A X 


HMM parameters 


p p p 


Marginal probabilities 


-Pi, f2,l> -P3,x,l 


Empirical marginal probabilities 


Cl. £2,1, e 3,z,l 


Sampling errors fSectionp.lll 




Matrix of m left singular vectors of p2,i 


&oo, S x , 6i 


True observable parameters using U [SectionlBTTl 


&oo, Pr, &1 


Estimated observable parameters using U 


£" A S" 

Ooo, As, Oi 


Parameter errors rSectionl5.ll 


A 

AX 


Z^ x l±x laectionp.il 


0-m(M) 


m-th largest singular value of matrix M 


h,\ 


True and estimated states [Section|5~3l 


th, h t , g t 


(U T 0)~ 1 b t , (U T 0)- 1 b u ht/iVnht) rSection|531 


A x 


{U t O)- 1 B x {U t O) [Sectionl531 


7, OL 


inf{||Ov||i : ||v||i = 1}, mm{[A x ] itj ] 



Table 1 : Summary of notation. 



Lemma 8. If the algorithm independently samples N observation triples from the HMM, then with probability at least 
1-r,: 



ei < 

£2,1 



< 



max £3^1 < 







3 




In 




N 




'/ 






3 




In 




N 




'/ 






3 




In 




N 




'/ 



N 
N 
N 



where e(k) is defined in ([TJl. 
Proof. See Appendix lAl I 

The rest of the analysis estimates how the sampling errors affect the accuracies of the model parameters (which 
in turn affect the prediction quality). We need some results from matrix perturbation theory, which are given in 
Appendix iBl 

Let U £ K nxm be matrix of left singular vectors of Pa,i- The first lemma implies that if p2,i is sufficiently close 
to P%,i, i-e. £2.1 is small enough, then the difference between projecting to range(JJ) and to range(J7) is small. In 
particular, U t O will be invertible and be nearly as well-conditioned as U t O. 

Lemma 9. Suppose e 2 .i < e ■ cr m (P2,i) for some e < 1/2. Let eo = e| i/((l — s)cm(P2,i)) 2 ■ Then: 

1. £0 < 1, 

2. a m (U T P 2A ) > (I - e)a m (P 2A ), 



3. <J m (U ' P 2 ,i) > Vr^7oa m (P 2 ,i), 
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Proof. The assumptions imply Eq < 1. Since a m (U P2.1) = Cm(-P 2 ,i), the second claim is immediate from 
Corollary l22l Let U E R nxm be the matrix of left singular vectors of P 2>1 . For any x E W\ \\U T Ux\\ 2 = 



x 2 



1 - Ht/^f/lll > \\xW2Vl ~ eo by Corollary|22]and the fact e < 1. The remaining claims follow. I 



Now we will argue that the estimated parameters boo,B x , b\ are close to the following true parameters from the 
observable representation when U is used for U: 



boo = {PliU) + Pi = {u T o)- T i mi 

B x = (U T P 3 ^ 1 )(U T P 2A ) + = (U T 0)A x (U T 0)- 1 forx=l, 
61 = U T P 1 . 



.,71, 



By Lemma [3] as long as U T is invertible, these parameters b^^Bx, b\ constitute a valid observable representation 
for the HMM. 

Define the following errors of the estimated parameters: 



A* 
A 

Si 



(C/ T 0) T (6 0O -6 0O ) 



{U T 0) T b 



00 x m 



1, 



{U T 0)- 1 (b x -B x ) (U t O) 



(U T 0)- 1 B x (U T 0)-A 1 



X 



(U T 0)-%-TT 



We can relate these to the sampling errors as follows. 
Lemma 10. Assume e 2 ,i < o- m (P 2: i)/3. Then: 

< 4 



„( 


£2,1 




V m {P2,l 


8 






<Tm(Q) 


8 




71 


Om(P) 


2 




71 


v m (0) 



£1 



Pr[x 2 = x] 



£2,1 



£3,x,l 



a m (P 2 ,i) 2 3a ra (P 2) i)y 

e 2,l IL e 3,x,l 



5, 

A. < 

A < 

Si < 



Proof. The assumption on £2.1 guarantees that U t O is invertible (Lemma|9]). 
We bounds = ll^^)^- fe^lloo by ||0 T || 00 ||f/(& 00 -& 00 )|| 00 < \\b x 

II&0C-600II2 



£1 



00 - 


MHooby \\0 


T ||oo||C/(fooo- 


boo 






- (P^U^Pl 


b 


< 


\\((P 2 ]iU) + - 


(P2 T .lU) + )Pi\ 


2 + 


< 


ll((^Ti^) + - 


(P2,lU) + )h\ 


Pl\ 


< 


1 + V5 


£2,1 





mm{a m (P 2 ,i),a m (P 2 T A U)} 2 Vm(PZiU) 

where the last inequality follows from Lemma[23] The bound now follows from Lemma[9] 

Next for A^, we bound each term \\(U T 0)- 1 (B x - B x ){U t O)\\i by y /m\\(U T 0)- 1 (B x - B X )U T \\ 2 \\0\\i < 
V^nlK^O) -1 ^!!^ - B x \\ 2 \\U T \\ 2 \\0\\i = vH|£* - B x \\ 2 /a m (U T 0). To deal with \\B X - B x \\ 2 , we use the 
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decomposition 

B x — B x 



< 



(u t p 3 , xA ) ((u t p 2A )+ - (u t p 2A )+) || 2 + \\u T (p 3 ,x,i - p 3 , x ,i) (u T p 2A y 

1 + V5 



£2.1 



£3,2;,! 



2 imn{a m (P 2t i),a m {U T P 2A )} 2 a m (U T P 2A ) 

1 + V5 £2,1 £3,x,l 



2 min{a m (P 2 ,i), ( T m (C/ T P2a)} 2 <r m {U T P 2 , 1 ) ' 



< |i%,x,i 



< Pr[a; 2 = x] • 



where the second inequality uses Lemma |23~1 and the final inequality uses the fact ||P3, x ,i||2 < y j [-F^x.ili j < 

Si j [P3,x,i]i.j = Pr[x 2 = x\. Applying Lemma|9]gives the stated bound on A x and also A. 

Finally, we bound <5i by <Jm\\{U T 0)~ 1 U T \\ 2 \\P X - Pi\\ 2 < y/meiJa n {U T O). Again, the stated bound follows 
from Lemma [9] I 

5.2 Proof of Theorem © 

We need to quantify how estimation errors propagate in the probability calculation. Because the joint probability of 
a length t sequence is computed by multiplying together t matrices, there is a danger of magnifying the estimation 
errors exponentially. Fortunately, this is not the case: the following lemma shows that these errors accumulate roughly 
additively. 

Lemma 11. Assume U t O is invertible. For any time t: 

W^O)- 1 (b^.M - B Xt;1 bi) || < (1 + A)% + (1 + A)* - 1. 

Proof. By induction on t. The base case, that ||(C/ t O)- 1 (&i - bi)\\i < (1 + A)% + (1 + A) - 1 = Si is true 
by definition. For the inductive step, define unnormalized states b t = bt{xi- t -i) = B Xt l l bi and bt = bt{xi-t-i) = 
B Xt _ 11 bi. Fix t > 1, and assume 

J2 \\(U r O)- l (bt - 6t)|| <(1 + A)*- 1 J 1 + (1 + A)*- 1 -1. 



Then, we can decompose the sum over xi-.t as 
Y,\\(U T 0)-HB x Ji-B Xt Ji)\\i 

:t 

= E E \{U T 0)-U(B Xt -B x )b t 



It Xi-t-l 

which, by the triangle inequality, is bounded above by 



B Xt -B Xt ){bt- b, 



B Xt (bt-b, 



E E \\0 T O)- 1 [B lllt -B Xt )0 T O) i (U T 0)- 1 b i 

X t Xut-1 

+ E E \{U T 0)- x [K-B x ){U T 0)1^0)^^-%, 

X t Xl:t-1 

+ E E \0 T O)- l B t {U T O){U T O)- x (b t -\ 



(2) 
(3) 
(4) 



X t Zl:t_l 
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We deal with each double sum individually. For the sums in (O, we use the fact that \\(U t O) 1 b t \\i = Pr[xi-.t-i], 
which, when summed over x\-t—u is !■ Thus the entire double sum is bounded by A by definition. For ([3), we 
use the inductive hypothesis to bound the inner sum over \\(U T 0)(b t — &t)||i; the outer sum scales this bound by 
A (again, by definition). Thus the double sum is bounded by A((l + A) t_1 <5i + (1 + A)* -1 — 1). Finally, for 
sums in we first replace (U T O)" 1 B t (U T O) with A Xt . Since A Xt has all non-negative entries, we have that 
||^4a: t w||i < l^AajJtTI for any vector v € M m , where \v\ denotes element-wise absolute value of v. Now the fact 
l^j A Xt \v\ = r^jT|?7| = = ||w||i and the inductive hypothesis imply the double sum in (|4| is bounded by 

(1 + A) t_1 (5i + (1 + A)* -1 - 1. Combining these bounds for ©, (O, and © completes the induction. I 

All that remains is to bound the effect of errors in 600. Theorem[6]will follow from the following lemma combined 
with the sampling error bounds of Lemma[8] 

Lemma 12. Assume £2.1 < 0?n(-P2,i)/3. Then for any t, 

J2\Pv[x 1:t } - Pr[x 1:t ]| < ««, + (1 + <5 00 )((1 + A) t 5 1 + (1 + A)' - l) . 

Proof. By Lemma|9]and the condition on £2,1, we have a m (U T 0) > so U t O is invertible. 
Now we can decompose the L\ error as follows: 

£|Pr[&i:t] - Pr[x 1:t ]| - J2\^>Z Xt ..M - blB Xu M 

Xl:t Xl:t 

= E|^^A - blB x ,M\ 
< 5]|(Soo-feoo) T (C/ T 0)(^ T 0)" 1 S ;Ct:1 6 1 | (5) 

Xl :t 

+ E |^oo - &oo) T (f/ T 0)(C/ T 0)- 1 (B, t:1 fci - B Xt ..M)\ (6) 

X\:t 

+ E |&L(f> T 0)(C/ T 0)- 1 (B 2;t:1 6 1 - B Xul h) . (7) 



The first sum © is 



E I - ^oo) T (C/ T 0)(C/ T 0)- 1 B :Ct:1 6 ] 



< V||(C/ T 0) T (fooo-fooo)|| II (^O)- 1 ^.^! 
^ — * II II 00 II 

< y^Joo l|A- t:1 7r||i = E £qq Pr[a;i :t ] = <5oo 



where the first inequality is Holder's, and the second uses the bounds in LemmafTOl 
The second sum (O employs Holder's and Lemma [TT1 



< 



(U T 0) T (b oc - 600) {U T 0)- 1 (B Xt Ji - B xt:1 h) 



(boo - 6 00 ) T (C/ T 0)(C/ T 0)- 1 (B a;t:1 & 1 - B Xttl b 1 ) ) 

< S^dl + AYSi + il + AY-l). 

Finally, the third sum (O uses LemmafTTI 

E |^(^ T 0)(C/ T 0)- 1 (^ t:1 6i - B xta h)\ = E ^(^Oy'iB^h - B Xt J>i) 

< ^\\(U t O)-\B x J 1 -B x J 1 ) 



< (l + A^ + fl + A) 4 -!. 



Combining these gives the desired bound. I 



14 



Proof of Theorem® By Lemma[8j the specified number of samples N (with a suitable constant C), together with the 
setting of e in no(e), guarantees the following sampling error bounds: 

ei < min (0.05 • (3/8) ■ <7 ro (P 2 ,i) ■ e), 0.05 • (V3/2) • a m (0) ■ (l/V^) ■ e) 
e 2 ,i < min (0.05 • (1/8) • a m {P 2 ,if • (e/5), 0.01 • (V3/8) ■ tr TO (0) ■ M^i) 2 ■ (l/(*Vm)) ■ e) 
^ e3,x,i < 0.39 ■ (3V3/8) • a m {0) ■ a m (P 2 ,i) ■ (l/(*Vm)) ■ e. 

These, in turn, imply the following parameter error bounds, via LemmafTUl < 0.05e, S\ < 0.05e, and A < OAe/t. 
Finally, Lemma [T2l and the fact (1 + a/t)* < 1 + 2a for a < 1/2, imply the desired L\ error bound of e. I 



5.3 Proof of Theorem 

In this subsection, we assume the HMM obeys Condition |3] (in addition to Condition[TJ. 

We introduce the following notation. Let the unnormalized estimated conditional hidden state distributions be 



and its normalized version, 



Also, let 



9t = ht/iljfit). 



A x = {U^O^B^O). 



This notation lets us succinctly compare the updates made by our estimated model to the updates of the true model. 
Our algorithm never explicitly computes these hidden state distributions g t (as it would require knowledge of the 
unobserved O). However, under certain conditions (namely Conditions[T]and[3]and some estimation accuracy require- 
ments), these distributions are well-defined and thus we use them for sake of analysis. 

The following lemma shows that if the estimated parameters are accurate, then the state updates behave much like 
the true hidden state updates. 

Lemma 13. For any probability vector w £ R m and any observation x, 



X 

[A x w]i 



< 600+ 5ccA + A and 
[A x w]i - A x 



> 



bl o {U T 0)A x w \J n A x w + 5 O0 + 5 O0 A x + A x 

Moreover, for any non-zero vector w € R"\ 



for all i = 1, . . . , m 



bl (U^O)A x w 1-Joo 

Proof. We need to relate the effect of the estimated operator A x to that of the true operator A x . First assume w is a 
probability vector. Then: 



b l oo {U T 0)A x w-T l m A x i 



= (b QO -b QO ) T (U T 0)A x w 

+ (boo -M 7 (U 1 0)(A X - A x )w + b 00 (U T 0)(A x -A x )iS 

< \\(boo -looV \U J COIUIA^H! 

+ \\(boo-b oo ) T (U T 0)\\ oo \\(A x -A x )\\ 1 \\w\\ 1 + \\(A X - AMwh 
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Therefore we have 



Y}Zo{U T 0)A x w-l 



< <Soo+(5ocA + A and bJ X) (U T 0)A x w < l^,A x w + 5^ + S X A X + A x . 



Combining these inequalities with 

[A x w]i = [A x w}i + [(A X -A x )w\i > [A^i-lliAx-A^wll! > [A x w]i- \\(A X - AJHiHIi > [A x w]i-A x 
gives the first claim. 

Now drop the assumption that w is a probability vector, and assume 1^ A x w without loss of generality. Then: 



lZ,A x w 



bUU T 0)A x i 



< 



lJ n A x w + (&oo - b oo ) T (U T 0)A x w 
IIA^Hi - 11(^0^(600 -6oo)IUIK«J||i 



which is at most 1/(1 — 5^) as claimed. I 

A consequence of Lemma [T3l is that if the estimated parameters are sufficiently accurate, then the state updates 
never allow predictions of very small hidden state probabilities. 

Corollary 14. Assume 8^ < 1/2, max^ A x < a/3, <5i < a/8, and max x Soo+S^Ax+Ax < 1/3. Then [g t ]i > ct/2 
for all t and i. 

Proof. For t = 1, we use Lemma [TOl to get \\hi — /ii||i < 5i < 1/2, so Lemma [P71 implies that \\h\ — <7ij|i < 4<5i. 
Then \cji\i > — \[hi)i — \g{\i\ > a — 4<5i > a/2 (using Condition^ as needed. For t > 1, Lemma [T3limplies 



[A x g t - 



> 



[A x g t - 



a — a/3 a 

> '— > - 

~ 1 + 1/3 - 2 



A x + A a 

using Condition[3]in the second-to-last step. I 

Lemma[l3]and Corollary[l4]can now be used to prove the contraction propert y of the KL-divergence between the 
true hidden states and the estimated hidden states. The analysis shares ideas from lEven-Dar et alj 1200711 . though the 
added difficulty is due to the fact that the state maintained by our algorithm is not a probability distribution. 



Lemma 15. Let £0 = maxj 2A x /a + (5oo + <5ooA x + A x )/a + 2<5oo. Assume < 1/2, max x A x < a/3, and 
max K S 00 + 8qoA x + A x < 1/3. For all t, ifgt S R m is a probability vector, then 



KL(h t+1 \\g t+1 ) < KL(h t \\g t )- 



2(lnf) 



;KL{h t \\g t f + so- 



Proof. The LHS, written as an expectation over xut, is 

KL(h t+ i\\g t +i) = E Xl:t 



[ht+i^ln— — r 
\9t+i]i 
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We can bound ln(l/[<? t+ i]j) as 



In 



IV. 



In 



t+IJi 



bUujQ)A Xt g t 



/ l^A Bt g f [AxM b l oa {U T Q)A Xt g t fT ~ 



m"t+l 



< In 



< In 



< In 



t^A Xt g t [A Xt g t ]i Vn A x t 9t + 6 



(1 + 2^00) 



r* —-i H 1" + 2doo 

K% t Stji J a a 

lJ n A Xt gt\ 



where the first inequality follows from Lemma[T3l and the second uses ln(l + a) < a. Therefore, 



KL{h t+ i\\g t +i) < Ex 



^[K t+ i]iln [ht+i]i 



VrAx t gt 

[A Xt gt]i 



so- 



(8) 



The expectation in ^ is the KL-divergence between Pr[/it|a;i:t_i] and the distribution over ht+i that is arrived at by 
updating Pr[/i t |xi : t_i] (using Bayes' rule) with Pr[/i t+1 \h t ] and Pr[a; t |/i t ]. Call this second distribution Pr[/i t+1 \x\-t\. 
The chain rule for KL-divergence states 

A-L(Pr[/ lt+1 |x 1:t ]||Pr[/ lt+1 |a; 1:t ]) + KL(Px[h t \h t +i, Xv. t ]\\vi[ht\ht + i, x 1:t }) 
= KL(Pi[h t \x 1:t }\\Pv[h t \x 1:t ]) + KL(Pv[h t+1 \h t ,x 1 .. t ]\\Pr[h t+1 \h t ,x 1:t }). 

Thus, using the non-negativity of KL-divergence, we have 

KL{Pr[h t+1 \x 1:t }\\Piih t+1 \x 1:t }) < KL(Pi[ht\x 1:t ]\\Pi[ht\x lit ]) + KL(fr[ht+i\h,xi*]\\&[ht+i\ht,x lzt ]) 

= KL(Pr{ht\x 1: t}\\Pr[ht\x 1: t}) 

where the equality follows from the fact that Pi[ht+i\ht,xi-.t] = Pr[/it+i|/it] = Pr[/it+i|/it] = Pr[/it+i \ht, xi-.t). 
Furthermore, 



Pr[h t = i\xi: t ] = Pr[h t = i\x 1: t-i] 



Pr[h t = i\xi:t] = Px[ht = i\xi-.t-i] 



Pv[xt\h t = i] 



Y,?=i Fr l x t\ h t = j] ■ p r[^t = i\xi-.t-i] 



and 



Pv[xt\h t = i] 



YJf=i Pr[x t |/it = j] ■ Pr [/it = j\x 1:t -i] 



KL(Pv[ht\x 1: t]\\PT[ht\x 1:t }) = E x 



2^ Pv[h t = i\x 1:t ] In ^- 

i= i Pr[h t = i\xi;t-i\ 



^ Pr[ftf = i|a; 1:t ] In 



E^iPr[*t|^=j]-Pr[ftt=j|a:i i t-i] 



i=l 



E™iPrN|/i i =i]-Pr[/H=i|x 1:4 -i] 
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The first expectation is 



E, 



2jPr[ft, t = i|xi :t ]ln 



Pv\ht = i\x 1:t -i} 
Pr[/i t = i\x 1:t -i\ 



E, 



E, 



E, 



Pr[x t |a;i :t _i] ^ Pr[h t = i|ari :t ] In 



5^5ZPr[a;t|ftt =*]-Pr[ft« =i|xi :t _i]In 

. x t i—l 

m 



Pr[fe f = i|xi :t _i] 
Pr[/i t = i|an !t _i] 

Pr[h t = i|xi :t _i] 



. at i=l 

and the second expectation is 



Pr[h t = i|x 1:t _i] 

Pr[fe t = z|xi :t -i] 
Pr[/i t = i|xi :t -i] 



E, 



^ Pr[h t = i|ari :t ] In 



E7 = iPr[^|fe t =j]-Pr[/i t =j> 1:t _ 1 ] 
E;i 1 Pr[x t |/ lt =j]-Pr[/i t =j> 1:t - 1 ] 



E, 



^Pr[x t |xi :t _i] In 



^l 1 Pr[x t \h t =j]-Pr[h t =j\x 1 .. t ^} 
EZi^t\h t = j}-Pi[ht=j\x 1: t-i} 



= KL(Oh t \\Og t ). 
Substituting these back into ([8}, we have 

KL(h t+1 \\g t+1 ) < KL{ht\\gt)-KL(Oht\\Og t )+eo. 

It remains to bound K L(Oh t \\Ogt) from above. We use Pinsker's inequality jCover and Thomas , 1991 1, which states 
that for any distributions p and q, 

together with the definition of 7, to deduce 



KL(p\\q) > -\\p-q\\l 



KL(Oht\\Offt) > h&^WOht-OgtWl > Z-E^Wht-gtWl 
Finally, by Jensen's inequality and Lemma[T8l(the latter applies because of Corollary [Pill, we have that 

E* l!t _j£t-&||? > (®x,. t -A\ht-9th) 2 > (Ai KL (>h\\9t)) 



In- 



which gives the required bound. I 



Finally, the recurrence from Lemma Q3] easily gives the following lemma, which in turn combines with the sam- 
pling error bounds of Lemma[8]to give Theorem[7] 

Lemma 16. Let Eq = max^ 2A x /a + (Soo +(5 00 A X + A x )/a + 25^ and S\ = max :E (5 00 + ^Jm5 00 /S. x + ^/mA x )/a. 
Assume < 1/2, max x A x < a/3, max T Soo + <5ooA x + A x < 1/3, Si < ln(2/a)/(87 2 ), e < ln(2/a) 2 /(47 2 ), 
and Ei < 1/2. Then for all t, 



KL{h t \\g t ) < max 45 1 log(2/a), 



'2(lnf) 2 £0 



and 



KL(PT[xt\xi :t -i] ||Pr[ar t |ari : t_i]) < KL(h t \\g t ) + S^ + S^A + A + 2ei. 
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Proof. To prove the bound on KL(h t \\gt), we proceed by induction on t. For the base case, Lemmas [TSI (with 
Corollarv[T4ii andfTTIimplv KL(hi \ \a-i ) < \\hi — <7i||i ln(2/a) < A5i\n(2/a) as required. The inductive step follows 
easily from Lemma [T5l and simple calculus: assuming c 2 < l/(4ci), z — c\z 2 + c 2 is non-decreasing in z for all z < 
\Jc-2/ci, so z' < z— c\z 2 +C2 and z < yj C2jc\ together imply that z' < \lcijc\. The inductive step uses the the above 
fact with z = KL(h t \\g t ), z' = KL{h t+l \\g t+1 ), c x =j 2 / '(2(ln(2/ a)) 2 ), and 02 = max(e , ci(4fi log(2/a)) 2 ). 

Now we prove the bound on ifL(Pr[x t |a;i :t _i]||Pr[;z; t |j;i :t _i]). First, let Pr[xt, ht\xi-,t-i] denote our predicted 
conditional probability of both the hidden state and observation, i.e. the product of the following two quantities: 



Pr[/i 



[g t ]i and Pi[x t \h t = i,x 1:t -i] 



[bUU T 0)A Xt ] 



E x bUU T 0)A x g t 
Now we can apply the chain rule for KL-divergence 

KL(Pr[xt\xi.,t-i}\\Pr[xt\x 1: t-i]) 

< KL(PT[h t \x 1 ..t-i]\\Pi[h t \x 1: t-i]) + KL{Pv[xt\ht,x 1 , t - l }\\PT[xt\h u x l ..t-i}) 

;ft,o,>L- E - iT - (Srofe ' 

_i=l x t \ 



[6T([/T )^ t 



= 1 a;* 



-XU T 0)A Xt ] 



ln(l 



JooA + A) 



where the last inequality uses Lemma [T3l It will suffice to show that Xt .i/\b] yo (U 1 0)A Xt ]i < 1 + 2ei. Note that 



a, 



[^^(U^ 0)A Xt ]i > a by Condition[3] Furthermore, for any 



|[6l([/ T 0)^ t ]«-0. t , 



Therefore 



Xt , 



[bUU T 0)A Xt ] 



< 



< 



< 



< \\bl{U T 0)A Xt - bl(U T 0)A Xt \\ oo 

< ||(fooo-fooo)(C/ T 0)||ooP. t ||oo 

+ IK&oo - 6oo)(C/ T 0)|| 00 ||A :ct -A x 
+ \\boo(U T 0)\\ 00 \\A Xt ~A Xt \\ 00 

< + x /m5 oc A Xt + \/mA Xt . 

Xtt i 



o. 



1 



< l + 2ei 



1-ei 



as needed. I 



Proof of Theorem\7\ The proof is mostly the same as that of Theorem|6]with t = 1, except that Lemma[T6l introduces 
additional error terms. Specifically, we require 

m(2/a) 4 m ~» ~» 



N>C ■ 



so that the terms 



e 4 a 2 7 4 <r m (0) 2 a m (P 2 .iY 



max 4J 1 log(2/a), 



and N > C 



'2m(2/a) 2 e 

7 2 



e 2 a 2 <r m (0)V m (P 2)1 )^ 



and £1, 



respectively, are 0(e). The specified number of samples N also suffices to imply the preconditions of Lemma[T6] The 
remaining terms are bounded as in the proof of Theorem [6] I 
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Lemma 17. If\\a — b\\± < c < 1/2 and bis a probability vector, then Wa/^a) — b\\i < 4c. 

Proof. First, it is easy to check that 1 — c < f 7 a < 1 + c. Let I = {i : ^/(l^a) > Then for i £ /, 

lai/dP^a) — = Si/^a) — bi < 3i/(l — c) — 6.; < (1 + 2c)<Ji — bi < \di — &i| + 2cdi. Similarly, for 
i £ I, \bi — di/(T T a)\ = bi — di/(l r a) < bi — di/(l + c) < bi — (1 — c)c?i < |6i — a;| + ca,-. Therefore 
Wa/iVa) - b||i < ||a-6||i +2c(T T d) < c + 2c(l + c) < 4c. I 

Lemma 18. Lef a anc/ fee probability vectors. If there exists some c < 1/2 such that bi > c for all i, then 
KL(d\\b) < ||a-6||ilog(l/c). 

Proof. See jEven-Dar et al.[|2007ll . Lemma 3.10. 1 
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A Sample Complexity Bound 

We will assume independent samples to avoid mixing estimation. Otherwise, one can discount the number of samples 
by one minus the second eigenvalue of the hidden state transition matrix T. 

We are bounding the Frobenius norm of the matrix errors. For simplicity, we unroll the matrices into vectors, and 
use vector notations. 

Let z be a discrete random variable that takes values in {1, . . . , d}. We are interested in estimating the vector 
q = [Pr(z = from N i.i.d. copies Zi of z (i = 1, . . . , N). Let qi be the vector of zeros expect the z^-th 

component being one. Then the empirical estimate of q is q = J^iLi Qi/N. We are interested in bounding the quantity 

115* - q\\l- 

The following concentration bound is a simple application of the McDiarmid's inequality iMcDiarmid , Il989h . 
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Pr f||g-g|| 2 > 1/VN + e) <e 



-Ne 



Proposition 19. We have Ve > 0: 



Proof. Consider q = J^iLi Qi/N, and letp = J^iLi Pi/N, where = qi except for i = k. Then we have \\q — q\\ 2 - 
\\p~ 9II2 < ||? — FII2 < y/2/N. By McDiarmid's inequality, we have 



Note that 



Pr(||g-9l| 2 >E||g-9l| 2 + e)<e 



1/2 



-Ne 





N 






N 




E 


i=l 


< 

2 


E 


i=l 


) 



JV \ */ 2 / N \ 

E E H*-'7ll2j = f IE [1 ^ H^lll] J 



1/2 



Mi-lkllD- 



This leads to the desired bound. I 

Using this bound, we obtain with probability 1 — 377: 

£1 < y/Hlfa)/N + y/T/N, 



£2,1 



< y/\n(l/ V )/N+ y/T/N, 



X \ X / 

If the observation dimensionality n is large and sample size jV is small, then the third inequality can be improved 
by considering a more detailed estimate. Given any k, let e(k) be sum of elements in the smallest n — k probabilities 
Pr[x 2 = x] = j [P3,x,i]ij (EquationQ]). Let Sk be the set of these n — k such x. By Proposition [T9l we obtain: 



II 



^ ||P3,x,l — Pz,x,l 
x£S k 

Moreover, by the definition of Sk, we have 



y^([-P3,a,l]ij - [P~3,xs]i 

xeS k i,j 



< v/ln(l/ ? ;)/7V+ y/T/N 



I|-P3,.t,1 — P3,x,l||-F < E |[-P3,x,l]ij — [-P3,x,l] 

xes k xes k i,j 



, x , 1 j ij 

x£S k i,j 

+ X/ llin (°> ^,x,l}ij - [P3,x,l]ij) + e(fc) 
xes k i,j 



< 



//[P3,x,l]ij - [Pa,x,l]i 
x£S k i,j 



2e(fc). 



Therefore 



X! £ 3,x,i < mm (^klnH/^/N + ^k/N + y/]n(l/ri)/N + y/T/N + 2e(fc)) . 

X 

This means C3 tXi i may be small even if n is large, but the number of frequently occurring symbols are small. 
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B Matrix Perturbation Theory 

The following perturbation bounds can be found in I Stewart and Sun , 1 199(1 

Lemma 20 (Theorem 4.11, p.J.04 in llStewart and Sunl Il990ll ). Let A g R mXTl with m > n, and let A = A + E. If 

the singular values of A and A are {a\ > . . . > a n ) and (a\ > . . . > <7 n ), respectively, then 



a-A < \\E\\ 



1, 



Lemma 21 (Theorem 4.4, p. 262 in IStewart and Sunlll990ll ). Let A g 

decomposition (U\, U2, U3, Si, £2, Vi, V2): 



vw'f/z m > n, with the singular value 



u7 



A [ V\ V 2 ] = 



Si 

s 2 





Let A = A + E, with analogous SVD {U\, U2, Us, Si, E2, V1V2). £ef $ £>e f^e matrix of canonical angles between 
range(J7i) anc/ range(fTi), ant/ be the matrix of canonical angles between range(Vi) and range(Vi). If there exists 
8, a > such that min cr(£i) > a + 5 and max cr(S2) < a, then 



max{|| sin<l>|| 2 , || sin @|| 2} < 



\E\\ 



Corollary 22. Let A g R mXn , with m > n, have rank n, and let U g R mx ™ £ e f/j e matrix ofn left singular vectors 
corresponding to the non-zero singular values o~\ > . . . > a n > 0/ A Let A = A + E. Let U g R" IX ™ £>g f/ie matrix 
ofn left singular vectors corresponding to the largest n singular values U\ > . . . > a n of A, and let Uj_ £ ^mx(m-n) 
be the remaining left singular vectors. Assume \\E\\2 < ecr n for some e < 1. Then: 

1. a n > (1 - e)a n , 

2. \\UjU\\ 2 < \\E\\ 2 /a n . 

Proof. The first claim follows from Lemma[20] and the second follows from Lemma l2"Tlbecause the singular values 
of Uj_U are the sines of the canonical angles between range(f7) and range(£/). I 



Lemma 23 (Theorem 3.8, p. 143 in IStewart and Sunill990ln . Let A 



\ with m > n, and let A = A + E. Then 



IA+- A 1 



< 



1 + V5 



c{[|A + ||l,||A + ||l}||£;| 



C Recovering the Observation and Transition Matrices 

We sketch how to use the technique of iMossel and Rochl l2006ll to recover the observation and transition matrices 
explicitly. This is an extra step that can be used in conjunction with our algorithm. 

Define the n x n matrix \P 3 ^]i^ = Pr[x3 = i, x\ = j]. Let O x = diag(O a;i i, . . . , 1>m ), so A x = TO x . Since 
P 3 x l = OA x T diag(7f)0 T , we have P 3A = J2 X Ps,x,i = OTT diag(7r)0 T . Therefore 

U T P 3 , x ,i = U T OTO x T diag(7?)0 T 

= {U T OT)O x {U T OT)- 1 {U T OT)Tdmg{n)0 T 
= (U 1 OT^iU 1 'OTy\U J > 3 ,i)- 

The matrix U T P 3j i has full row rank, so (U T P 3 .i)(U T P 3 .i) + = I> an d thus 

(U T P 3 , x ,i)(U T P 3 ,i) + = (U T OT)O x (U T OT)-\ 
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Since O x is diagonal, the eigenvalues of (U 1 P^^,!}^ 1 P 3 ,i) + are exactl y the observation probab ilities O rj i, . . . , O r . 



1 probabi 

Define i.i.d. random variables g x ~ iV(0, 1) for each x. It is shown in llMossel and Ro ch. 2006] that the eigenvalues 

of 

5>*([/ T /W)(£/ T P 3 ,i) + = (U T OT) [Y^gzOA {U T OT)~\ 

X \ X / 

will be separated with high probability (though the separation is roughly on the same order as the failure probability; 
this is the main source of instability with this method). Therefore an eigen-decomposition will recover the columns of 
(U T OT) up to a diagonal scaling matrix S, i.e. U T OTS. Then for each x, we can diagonalize ([/ T -P3,a; j i)(t/ T i :> 3,i) + : 

(l/TOTS)- 1 (U T P 3 ^ 1 )(U T P 3A )+ (U T OTS) = O x . 

Now we can form O from the diagonals of O x . Since O has full column rank, + = I m , so it is now easy to also 
recover 7? and T from P\ and P2.i- 

O+Pi = + Ott = 7? 

and 

+ P2,i(0 + ) T diag(7r)- 1 = + (OTdiag(7T)0 T )(0 + ) T diag(vr)- 1 = T. 
Note that because I Mossel and Rochll2006ll do not allow more observations than states, they do not need to work 



in a lower dimensional subspace such as range(C/). Thus, they perform an eigen-decomposition of the matrix 



and then use the eigenvectors to form the matrix OT. Thus they rely on the stability of the eigenvectors, which depends 
heavily on the spacing of the eigenvalues. 
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